Skip to content

Fix stale goroutine label, validate UTF-8 on label strings#1454

Merged
christos68k merged 1 commit into
open-telemetry:mainfrom
parca-dev:fix-stale-go-label
Jun 2, 2026
Merged

Fix stale goroutine label, validate UTF-8 on label strings#1454
christos68k merged 1 commit into
open-telemetry:mainfrom
parca-dev:fix-stale-go-label

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

@gnurizen gnurizen commented May 22, 2026

Summary

This regressed in 40dd9d9 ("ebpf: simplify get_pristine_per_cpu_record" #1091), which removed the explicit zeroing of trace->custom_labels.labels on the assumption that the slots were only read after a successful write that fully populated them. The slots are populated via bpf_probe_read_user, which only writes the bytes it reads, so a key/value shorter than one previously written to the same per-CPU slot inherits trailing bytes from the prior trace and produces a corrupted label.

  • eBPF side: write a single NUL after each bpf_probe_read_user.
  • Go side: validate label keys and values as UTF-8, which OTLP/pprof require. Strictness is deliberately asymmetric:
    • Keys are strict. Any invalid byte (including a single split-rune continuation at the end) drops the whole label. A corrupted key would silently group unrelated samples under a garbage name, which is worse than dropping.
    • Values are lenient. Fixed-width eBPF buffers can clip a multi-byte rune in half at the buffer boundary; we salvage the longest valid UTF-8 prefix rather than discard the label, so a clipped request_id or customer_name still arrives with everything up to the broken rune intact.
  • Two new metrics (IDGoLabelsDroppedInvalidName, IDGoLabelsDroppedInvalidValue) count labels dropped for each reason.
  • comm is left as best-effort: kernel-supplied, almost always ASCII, no useful fallback if it isn't.

Supersedes #1453.

@gnurizen gnurizen force-pushed the fix-stale-go-label branch 3 times, most recently from 108290a to b1d5bac Compare May 22, 2026 13:04
@gnurizen gnurizen marked this pull request as ready for review May 22, 2026 13:09
@gnurizen gnurizen requested review from a team as code owners May 22, 2026 13:09
@gnurizen gnurizen force-pushed the fix-stale-go-label branch from 3a11325 to 95456a8 Compare May 26, 2026 12:26
Comment thread tracer/tracer.go Outdated
Comment thread tracer/tracer.go Outdated
Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread tracer/tracer.go Outdated
@gnurizen gnurizen force-pushed the fix-stale-go-label branch from 3d342e2 to 53d7937 Compare May 27, 2026 12:32
@gnurizen
Copy link
Copy Markdown
Contributor Author

I (hopefully) addressed all the review comments, RFAL @open-telemetry/ebpf-profiler-approvers.

Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread support/ebpf/go_labels.ebpf.c Outdated
Comment thread tracer/tracer.go Outdated
@gnurizen gnurizen force-pushed the fix-stale-go-label branch from 3abd1f9 to aa5f4b7 Compare June 1, 2026 10:58
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Jun 1, 2026

Review feedback addressed, RFAL, thanks @florianl and @fabled

@gnurizen gnurizen force-pushed the fix-stale-go-label branch from aa5f4b7 to f625eea Compare June 1, 2026 11:56
This regressed in 40dd9d9 ("ebpf: simplify
get_pristine_per_cpu_record" open-telemetry#1091), which removed an explicit zeroing of
trace->custom_labels.labels on the assumption that the slots were only
read after a successful write that fully populated them. The slots are
populated via bpf_probe_read_user, which only writes the bytes it reads,
so a key/value shorter than one previously written to the same per-CPU
slot inherits the trailing bytes from the prior trace and produces a
corrupted label.

The eBPF side now writes a single NUL after each bpf_probe_read_user;
userspace stops at the first NUL, so one terminator is sufficient. On
the Go side, label keys and values are validated as UTF-8 (required by
OTLP/pprof), with deliberately asymmetric strictness:

  - Keys are strict: any invalid byte (including a single split-rune
    continuation at the end) drops the whole label. A corrupted key
    would silently group unrelated samples under a garbage name.
  - Values are lenient: on fixed-width truncation that splits a
    multi-byte rune we salvage the longest valid UTF-8 prefix rather
    than drop the label, so a clipped request_id or customer_name still
    arrives with everything up to the broken rune intact.

Two metrics count labels dropped for each reason. Comm is left as
best-effort: it's kernel-supplied, almost always ASCII, and we don't
have a useful fallback if it isn't.
@gnurizen gnurizen force-pushed the fix-stale-go-label branch from f625eea to 8f88019 Compare June 2, 2026 14:06
@christos68k christos68k merged commit ec44063 into open-telemetry:main Jun 2, 2026
32 checks passed
gnurizen added a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request Jun 2, 2026
tracer: align go labels validation with upstream PR open-telemetry#1454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants